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FIELD OF INVENTION 

The invention relates to a packet switching device for performing real-time bandwidth 
provisioning. In particular, the invention relates to an apparatus and method for 
implementing an internal label used to measure and correct various quality of service 
30 parameters including queue delay between an ingress blade and egress blade of a 
switching device. 
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BACKGROUND 

Switches including Ethernet switches generally include numerous ports through which 
the switch receives and transmits data packets. These switches commonly include a 
5 plurality of switch modules with packet processors that operate at layer 2 and 3 of the 
Open Systems Interconnect (OSI) model but are capable of providing some layer 4 
through 7 functionalities depending on the configuration. Operably coupling the various 
switch modules is a back plane comprising a switching fabric that provides a circuit- 
switched path linking each switch module to every other switch module. The switching 
10 fabric of the back plane is a store and forward device capable of storing packets until 
ready for output. 

The bandwidth available to transfer packets between switching modules is limited. In 
order to regulate access to the switch fabric among the competing queues, packets are 

15 buffered at each of the ingress switching modules until a scheduler releases each of the 
packets from its queue. The queue memory in which the ingress packets are temporarily 
stored are generally categorized into a plurality of priority levels to provide higher levels 
of service to select traffic. In principle, the higher priority traffic is serviced prior to 
lower priority traffic, and time-critical flows transmitted through the switching fabric 

20 before less-critical traffic. 

In practice, there are numerous bandwidth allocation schema for implementing queue 
prioritization, each with its own particular trade-offs. In strict priority, for example, a 
lower priority queue is only dequeued after all higher priority queues are empty, which 

25 can completely starve the lowest level queues. In a weighted fair queuing, each of the 
queues is assigned a weight indicating its relative importance to the other queues. 
Queues are then dequeued in a round robin fashion with each queue being allotted a 
percentage of the available bandwidth in proportion to its particular weight. In this 
manner, each of the queues is serviced with emphasis given to the highest priority 

30 queues. 



2 



As an unintended consequence of weighted fair queuing, some types of traffic in one or 
more queues can effectively exceed the prescribed upper bandwidth limits associated 
with a queue and effectively starve, albeit temporarily, other queues of their requisite 
bandwidth. A lower priority queue can starve the highest priority queue of bandwidth, 
5 for example, if the lower priority queue is permitted to monopolize the bandwidth by 
transmitting one or more relatively large packets to the switch fabric. Under these 
circumstances, the switching device may fall short of real time guarantees resulting in 
increased delay and jitter of high priority traffic. There is, therefore, a need for a 
switching device adapted to perform real-time traffic engineering on traffic flows of 
10 inter-blade traffic. 

SUMMARY 

The present invention features a method and apparatus for provisioning bandwidth 
among a plurality of queues in a switching device. The bandwidth provisioning method 

15 preferably comprises the steps of appending a QoS label comprising a timestamp to a 
PDU segment, either a inbound PDU or fragment of the inbound PDU, at a first 
switching device; buffering the PDU segment in one of the plurality of queues; 
conveying the PDU segment to a second switching device; determining the delay for the 
PDU segment to propagate between the first switching device and the second switching 

20 device; and altering at least one of the one or more queueing properties at one or more 
queues depending on the delay observed. A PDU fragment is a fractional portion of a 
PDU that is generated by parsing an inbound PDU. In the preferred embodiment, 
fragmentation is used to prevent a PDU or a flow, for example, from monopolizing the 
queue resources necessary to transmit PDUs from an ingress switching device to an 

25 egress switching device, thereby making bandwidth available to other priority queues. 

The queueing properties, in the preferred embodiment, are used to indicate whether to 
enable subsequent PDUs or traffic flows to be fragmented for the purpose of adjusting the 
aUocation of bandwidth necessary to communicate PDUs between the first switching 
30 device and a second switching device. The queueing properties may also be used to 
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detennine the length into which PDUs are fragmented, thereby offering a means to make 
fine adjustments to the bandwidth allocation scheme. 



The bandwidth provisioning method preferably includes further steps for removing the 

5 QoS label at the second switching device and reassembling the plurality of PDU 
fragments into at least one protocol data unit (PDU) at the second switching device. 

The bandwidth provisioning apparatus of the preferred embodiment comprises a first 
switching device, comprising a plurality of queues with associated queueing properties, 
10 for appending a timestamp to one or more PDU segments at a first switching device; and 
a second switching device, operatively coupled to the first switch device, for altering the 
length of one or more PDU segments buffered at one or more queues of the plurality of 
queues depending on the time for the one or more PDU segments to propagate between 
the first switching device and the second switching device. 

15 

BRIEF DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example and not limitation in the figures of 
the accompanying drawings, and in which: 

FIG. 1 is a functional block diagram of bandwidth-provisioning switching device, 
20 according to the preferred embodiment of the present invention; 

FIG. 2 is a functional block diagram of a switching module, according to the preferred 

embodiment of the present invention; 
FIG. 3 is a flow chart of the method of performing real-time bandwidth-provisioning at 
an ingress switching module, according to the preferred embodiment of the present 
25 invention; 

FIG, 4 is a flow chart of the method of performing real-time bandwidth-provisioning at 
an egress switching module, according to the preferred embodiment of the present 
invention; 

FIG. 5 is a flow chart of the method of evaluating the queue delay, according to the 
30 preferred embodiment; 
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FIG. 6A is an exemplary inter-module header used to label PDUs and PDU fragments in 
between switching devices, according to the preferred embodiment of ttie present 
invention; 

FIG. 6B is a distribution identifier, according to the preferred embodiment of the present 
5 invention; and 

FIG. 6C is a multiplexor identifier, according to the preferred embodiment of the present 
invention. 

DETAILED DESCRIPTION 

10 Illustrated in FIG. 1 is a functional block diagram of an enterprise switch comprising a 
system of switch ports and a switch fabric. The enterprise switch 100 is one of a plurality 
nodes and other addressable entities operatively coupled to a communications network 
such as an Internet Protocol (IP) network embodied in a local area network (LAN), wide 
area network (WAN), or metropolitan area network (MAN), for example. The enterprise 

15 switch 100 preferably comprises a plurality of switching devices 140, 142 operatively 
coupled to one another by means of a switch fabric 150. A switching devices may take 
the form of a switch preprocessor, switch postprocessor, or blade embodied in a module 
adapted to engage a slot in the enterprise switch chassis that operatively couples the blade 
to a backplane 152. 

20 

Each of the plurality of switching devices 140-142, or blades, preferably comprises one 
or more network processors 102-106 generally capable of, but not limited to, at least 
layer 2 and layer 3 switching operations as defined in the OSI network model. Each of 
the blades 140, 142 is adapted to transmit and receive packet data to and from the 
25 network via conmiunications links (not shown) and to and from one another by means of 
the switch fabric 150. 

For purposes of this application, data flowing into a blade, i.e. a switch module, from a 
conmiunications link toward the fabric 150 is referred to herein as ingress traffic 
30 comprising ingress protocol data units (PDUs), and the switching device through which 
ingress data propagates is generally referred to as the ingress switching device. 
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Furthermore, data flowing from the fabric 150 to a communications link is referred to as 
egress traffic comprising egress PDUs, and the switching device through which it 
propagates is referred to as an egress switching device. Each of the plurality of switching 
devices of the present embodiment can serve as both an ingress switching device and an 
5 egress switching device depending on the direction of the traffic. 

In the preferred embodiment, the switching device is a IEEE 802.3-enabled switch 
employing one or more media access control (MAC) devices 108-1 12 and one or more 
physical layer receivers/transceivers 114-118 operably coupling the enterprise switch 100 
10 to a plurality of wired and/or wireless communications links including, for example, a 
WAN/LAN optical transceiver 122 by means of the sonnet framer 120. 

In the preferred embodiment, the enterprise switch 100 further includes a central 
conmiand processor 126 for monitoring and managing various system resources 
15 including a configuration database storage (CDS) 128 and statistics and accounting 

storage (SAS) 130. The central command processor 126 preferably resides on one of the 
plurality of switching devices. 

Illustrated in FIG. 2 is a functional block diagram of a representative switching device. 

20 The switching device 140 comprises a plurality of network interface modules (NIMs) 
202, 204, one or more network processors 102, and a fabric interface module 208. Each 
of the NIMs 202, 204 is operatively coupled to an external port for purposes of receiving 
and transmitting data traffic. The NIMs are preferably Ethernet-enabled ports comprising 
one or more physical 1 14 interfaces and one or more media access control (MAC) 

25 interfaces 108. Both ingress and egress PDUs are then exchanged between the blade 140 
and the plurality of NIMs 202, 204 by means of one or more internal data busses 206. 

The network processor 102 in the preferred embodiment comprises a management 
module 220, a routing engine 230, and a queue manager 240. The management module 
30 220 generally comprises a policy manager 222 for retaining and implementing policy 
rules, including QoS policy, provided by the central command processor (CCP) 126 



6 



and/or a network administrator via the configuration manager 224. An internal copy of 
the policy rules are preferably retained in high speed look-up cache 212 for purposes of 
providing real-time support for the routing engine 230 operating a wire speeds. 



5 The routing engine 230 of the preferred embodiment is adapted to receive ingress data 
from the NIMs 202, 204, parse the data, perform address look up from cache 212, and 
process the individual PDUs for either layer 2 switching or layer 3 routing, for example, 
prior to forwarding the PDU to the queue manager 240. The queue manager 240 
preferably prioritizes and buffers the ingress traffic in ingress memory 242 prior to 

10 forwarding it to the fabric interface module 208. The ingress memory 242 comprises a 
plurality of queues of differing priority for purposes of providing class of service 
(CoS)/quality of service (QoS). In some embodiments, the switching module 140 further 
includes an ingress policer 250 for selectively filtering data prior to being enqueued at 
ingress queue memory 242 and/or an ingress shaper 252 for selectively filtering data 

15 prior to being forwarded to the switch fabric 150. 

In this embodiment, the fabric interface module 208 also receives egress data from the 
fabric 150 that is generally buffered in egress queue memory 248, conveyed through the 
routing engine 230 for statistical processing, for example, and transmitted on the 
20 appropriate egress port via one of the NIMs 202, 204. In some embodiments, the 

switching module 140 further includes an egress policer 252 for selectively filtering data 
prior to being enqueued in egress memory 248 and/or an egress shaper 256 for selectively 
filtering data prior to being forwarded to the routing engine 230. 

25 As discussed in more detail below, a PDU label is used to convey QoS properties 
including transit time information, fragmentation information between the ingress 
switching device and egress switching device to increase the efficiency and throughput of 
the enterprise switch 100. In some embodiments, the QoS properties may fiirther include 
policing/shaping information with which the ingress switching devices may selectively 

30 enable policing and/or shaping of the ingress or egress traffic streams. 
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In addition to the conventional switching and routing systems, the routing engine 230 of 
the preferred embodiment further comprises a QoS manager 232, a fragmentation module 
234, a label generator 236, a statistics manager 238, and an assembly module 239. The 
QoS manager 232 oversees the fragmentation of select ingress PDUs and the labeling of 
5 those PDUs for purposes of performing real-time bandwidth provisioning. This includes 
tracking or otherwise monitoring one or more signals from the CCP 126 indicating when 
to enable or modify the fragmentation operation, which of the one or more ingress queues 
or ingress traffic flows on which to enable or modify fragmentation, and the 
fragmentation parameters such as the maximum fragment size into which to divide a 
10 PDU. 

The fragmentation module 234 is preferably adapted to parse select ingress PDUs into 
one or more PDU fragments that are forwarded to the fabric 150, PDU fragments have a 
QoS label appended to them by the label generator 236 prior to being enqueued. While a 

15 QoS label may also be appended to the unfragmented PDUs, the label applied to PDU 
fragments may further include a fragment identifier or sequence number for purposes of 
reconstructing or otherwise restoring the original packet at the egress switching device. 
After the label generator 236, the PDU segments, including both labeled PDUs and 
labeled PDU fragments, are transmitted to the queue manager 240 where they are 

20 buffered and scheduled for output. 

In generally, the switching device 140 is adapted to enable fragmentation of PDUs 
primarily when one or more queues are adversely impacted due to high bandwidth 
consumption at one or more other ingress priority queues. Various metrics including 
25 inter-module delay, packet size variation, traffic throughput, and packet queue depth, for 
example, may be used to evaluate the real-time performance for every flow of traffic. 
Although the degree to which ingress traffic is fragmented is an implementation issue 
largely dependent on policy defined by the network administrator, a significant portion of 
the ingress traffic is generally not fragmented by the enterprise switch 100. 

30 



The ingress PDUs and PDU fragments generated by the routing engine 120 are conveyed 
to the queue manager 240 and buffered in one of a plurality of priority queues of ingress 
queue memory 242. Each of the N priority queues of the ingress queue memory 242 is 
associated with a different level of priority and correlates with a unique CoS or QoS 
5 level. In the preferred embodiment, there are N=4 priority queues for each of the ingress 
ports of the switching device 140, although this is subject to variation depending on the 
application. PDUs and PDU fragments are enqueued, using a prioritization scheme such 
as strict priority or weighted fair queuing in a modified form discussed in more detail 
below. 

10 

One skilled in the art will appreciate that the functional entities, including the 
fragmentation module 234 and assembly module 239 for example, may be incorporated 
into the queue manager 240 instead of the routing engine 230 while still preserving the 
benefits of the present invention. One skilled in the art will also appreciate that the 
15 routing engine 230 is one of a class of processing resources with which the present 

invention may be practiced. Alternative processing resources include traffic classifiers, 
rate policers, accounting devices, editing devices, and address look-up devices. 

Illustrated in FIG. 3 is a flow chart of the method of performing real-time bandwidth 
20 provisioning with an ingress switching device. In the preferred embodiment, the real- 
time bandwidth-provisioning is implemented by fragmenting selected ingress PDUs on 
an ingress switching device to prevent PDUs of a particular priority queue from 
inadvertently consuming a disproportionately large amount of bandwidth at the expense 
of other priority queues. 

25 

As indicated by the classification step 302, the policies 212 are applied and an ingress 
PDU assigned to one of the plurality of priority queues of ingress queue memory 242, 
effectively prioritizing (step 303) the PDU for purposes of providing some QoS or CoS. 
In some embodiments, the PDU or PDU fragment may be selectively filtered by the 
30 ingress policer 250 prior to conmiitting the PDU in memory 242. Whether the ingress 
switching device 140 proceeds to parse the PDU into a plurality of PDU fragments, as 
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determined in fragmentation testing step 305, depends on whether fragmentation is 
enabled by the CCP 126. The decision at the CCP 126 whether to enable fragmentation 
on a particular priority queue or flow, for example, is preferably made in consideration of 
traffic systemic patterns present across at each of the switching devices coupled to the 
5 switch fabric 150. In generally, the decision to enable fragmentation is made in real-time 
based upon a statistical feedback system monitoring the transit time from ingress blade to 
egress blade. When enabled, fragmentation preferably occurs contemporaneously with or 
subsequent to layer 2 and layer 3 processing of incoming packets from the one or more 
ingress ports. 

10 

Each of the plurality of priority queues 1-iVin ingress queue memory 242 is associated 
with a different priority level, each priority level correlating with one or more CoS or 
QoS levels. Each of the one or more QoS levels is defined by one or more policies 
governing the transmission of PDUs through the node. The policies may set forth 
15 bandwidth requirements, maximum jitter, queue delay, transit delay, and the preference 
and frequency with which packets are distributed to the switch fabric 150, for example. 

In the preferred embodiment, ingress PDUs are distributed to one of four priority level 
queues per port per switching device. The highest priority queue is dedicated to the 

20 highest class of service, which generally prescribes a minimum bandwidth and/or a 

minimum queue delay, for example. The type of traffic serviced by the highest priority 
queue generally includes voice communications and video transmissions which require 
minimal latency. The remaining priority queues have progressively lower levels of 
priority corresponding to lower levels of service. PDU and PDU fragments in the lowest 

25 priority queue, the default queue, have no service guarantees and are distributed to the 
switch fabric 150, in the case of strict priority queuing, only when the higher priority 
queues are empty. 

While the switch fabric 150 may also include memory for buffering PDUs being 
30 transmitted from the ingress switching device to the egress switching device, the switch 
fabric 150 does not give preferential treatment to any PDU, regardless of its priority. The 

10 



primary purpose of the switch fabric 150 is simply to deliver packets from one traffic 
manager to another and to, perhaps, filter traffic when the fabric's internal queues reach 
capacity. The filtering performed by the switch fabric 150 is, therefore, based on the 
volume of traffic received rather than the class or priority of the traffic. The preferred 
embodiment of the present invention overcomes this problem and other problems by 
introducing a feedback mechanism accessible to all the ingress queue managers attached 
to the same switching fabric 150. 

In the absence of excessive bandwidth consumption problem, the fragmentation module 
234 is disabled and the fragmentation testing step 305 answered in the negative. When 
disabled, the ingress PDU is sent to the ingress queue 242 where it is appended with a 
QoS label comprising a time stamp (step 306). The time stamp, which is used to quantify 
the throughput of each of the queues, may be the actual time or some standard used for 
reference internal to the switching device. The packet with QoS label is then buffered 
enqueuing step 308 at the priority queue determined in classifying step 302. 

If the fragmentation module 234 is enabled by the CCM 126 for the PDU, the 
fragmentation testing step 305 is answered in the affirmative and the PDU parsed into 
one or more PDU fragments in fragmenting step 310. Each of the selected PDUs is 
divided in parsing step 310 into a pluraUty of PDU fragments. PDU fragmentation is 
generally required when one or more queues associated with the particular ingress port 
are being under serviced. Since a PDU fragment is a smaller quantum of data than the 
original PDU, fragmentation prevents a particular priority queue from monopolizing the 
channel between the ingress queue 242 and the switch fabric 150 for the period of time 
necessary to transmit the entire PDU. Fragmentation of PDUs of a particular priority 
queue can, therefore, reduce the queue delay in other priority queues by make the 
bandwidth to the fabric 150 available sooner than without fragmentation. 

The selection of PDUs for fragmentation may be implemented using one or more 
approaches either alone or in combination. In a first approach, fragmentation is enabled 
on a per flow basis for one or more ingress flows. Fragmentation may be triggered, for 
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example, when one or more traffic flow metrics exceeds an associated threshold. In a 
second approach, PDUs are selected for fragmentation on a packet-by-packet basis as a 
function of various parameters including, for example, the PDU length or the current 
level of congestion in one or more ingress priority queues 242. In a third approach, 
5 fragmentation is enabled for a fractional packet, i.e. the remainder of a PDU still buffered 
at the ingress queue memory 242 after transmission to the fabric 150 has begun but 
before completion of transmission. In a fourth approach, fragmentation is enabled for all 
ingress PDUs assigned to a congested priority queue at the ingress queue memory 242. 
In a fifth approach, fragmentation is enabled for PDUs from lower priorities queues that 
10 are destined to the congested switch fabric queues (not shown). In a sixth approach, 
PDUs destined to a congested egress queue are fragmented to relieve congestion at the 
destination egress switching device. 

In the preferred embodiment, the one or more PDUs that are selected for fragmentation 
15 are divided into a plurality of substantially equal-length PDU fragments. In general, the 
PDU should be divided into the minimum number of PDU fragments necessary to 
achieve a desired result. If the desired result is a maximum jitter, for example, the 
maximum size PDU fragment must be small enough that time required to output the PDU 
fragment to the fabric 150 is less than the maximum delay of the higher priority queue. If 
20 a high priority queue had a maximum queue delay of 10 millisecond, for example, the 
maximum size fragment in a system with a 1 gigabit per second switch fabric would be 
restricted to 1.25 kbytes. While the actual number and size of a PDU fragments may be 
subjectively determine on a per packet or per flow basis, the more congested the network 
the smaller the fragmentation size, as a general rale. 

25 

A simple formula for calculating a suitable fragmentation size is as follows: Current 
Fragmentation Size = (Maximum Frame Size)/Max[(Current Delay / Acceptable Delay), 
1]. In this example, fragmentation is generally enabled when Current Delay is greater 
than Acceptable Delay. One skilled in the art will recognize, however, that this is only 
30 one of numerous ways for calculating an optimal PDU fragment length. 
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In some embodiments, each of the PDU fragments is assigned a fragment identifier, such 
as a sequence number, in the numbering step 313 to facilitate the reconstruction of the 
fragmented PDU at the egress switching device. The sequence number or a pointer 
thereto is preferably incorporated into a QoS label in conjunction with a time stamp 
5 appended to the PDU fragment (step 314). The PDU fragments with QoS label are then 
enqueued in the ingress queue memory 242 as illustrated in enqueuing step 316. In the 
preferred embodiment, each of the fragments derived from a conmion PDU is assigned to 
a conmion priority queue in enqueuing step 316 in sequential order from the start of the 
packet to the end. 

10 

As illustrated in the queue scheduling step 318, both PDUs and PDU fragments are then 
distributed to the fabric 150 using some scheduling algorithm such as strict priority or 
weighted fair queuing, for example. For purposes of the scheduling algorithm, there is no 
distinction between PDUs and PDU fragments. As illustrated by return path 330, the 
15 ingress blade 140 repeats the QoS provisioning process 300 for each successive PDU of 
each ingress port destined for the switch fabric 150. 

In some embodiments, an egress blade monitoring the QoS labels from the ingress blade 
may also provide feedback with which the ingress blade can selectively filter (step 320) 
20 the PDU or PDU fragment at the ingress shaper 252. 

Illustrated in FIG. 4 is a flow chart of the method of performing real-time bandwidth 
processing at an egress switching device. The switching device 140 in the preferred 
embodiment, which is adapted to process both ingress and egress traffic, serves as both 
25 and ingress switching device and an egress switching device depending on the PDU and 
its direction of flow. For purposes of discuss below, reference is made to the egress 
switching device, blade 140, with the understanding that it is structurally identical to the 
ingress switching device 140, but generally a different device than the ingress switching 
device. 

30 
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First, the egress switching device 140 receives a PDU segment from the switch fabric 
150, as illustrated in egress PDU receiving step 402. A PDU segment as termed herein 
includes both intacted PDUs as well as PDU fragments whether or not they include a 
QoS label. Upon receipt, the egress switching device 140 conveys the egress PDU 

5 segment to the statistical manager 238 of the routing engine 230 where the timestamp of 
the QoS label is read. The statistical manager 238 is a computational entity that serves, 
in part, to acquire some of the statistical figures necessary to decide whether to maintain 
the enable/disable or alter the fragmentation module signal seen on one or more ingress 
blades. The statistical manager 238 generally determines relative time between 

10 successive PDU segments, although it may also compare the current time from clock 214 
to the timestamp in order to evaluate the queue delay (step 404). While the queue delay 
is intended to represent the period of time that a PDU segment is buffered in the ingress 
queue memory 242 on the ingress switching device, it may further include propagation 
delay associated with the transmission through the switch fabric 150. Removal of the 

15 QoS label (step 406) may occur prior to, subsequent to, or contemporaneously with the 
queue delay evaluation (step 404). 

If a particular PDU segment from the fabric 150 is an unfragmented PDU the fragment 
detecting step 408 is answered in the negative and the PDU forwarded (step 414) through 

20 ttie egress switching device 140 to the appropriate egress port. If, however, the PDU 
segment is identified as a PDU fragment, and it is not filtered by the egress policer 409, 
the PDU fragment is buffered (step 410) until each of the fragments of the original PDU 
are received. Although each of the PDU fragments of a PDU are received in the order in 
which they were transmitted, receipt of the PDU fragments may be interleaved with other 

25 PDU segments corresponding to other priority queues and other ingress switching 
devices. 

Once each of the PDU fragments of a PDU are received, the PDU is reassembled (step 
412) at the assembly module 239 and then forwarded to the appropriate egress port 202, 
30 204 in the same manner as the unfragmented PDU. The reassembly of the PDU (step 

412) in the preferred embodiment constitutes a complete restoration of the PDU, which is 
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then indistinguishable from the original PDU prior to fragmentation (step 310). The 
egress QoS provisioning process 400 is repeated by way of a return path 422 for each 
PDU segment from the switch fabric 150. 



5 In some embodiments, the switching device is further adapted to filter the PDU at the 
egress shaper 256 after the PDU is buffered at the egress queue memory 248 prior to 
being transmitted from the egress port. The decision whether to filter the PDU may be 
based in part on feedback timing information derived from the QoS label. 

10 Illustrated in FIG. 5 is a flow chart of the method of evaluating the queue delay. As a 
illustrated in the first step, the time stamp extracted from the QoS label, i.e. QoS header, 
is provided as input (step 502) to a statistical manager 238. The purpose of the statistical 
manager 238 is to determine the queue delay from one or more ingress queues necessary 
to assess efficacy of the fragmentation scheme or schema at one or more ingress blades. 

15 The timestamp from the QoS header is referred to herein as the origination time, which 
represent a conmion point of reference with which to measure the QoS performance. 

Using the origination time, the statistical manager computes an observed queue delay 
(OQD). The OQD as used herein comprises the actual delay necessary to buffer a PDU 

20 or PDU fragment on the ingress switching device and transmit it to the egress switching 
device. In some embodiments, the OQD is the difference between the origination time 
and the time of receipt at the egress blade, and is computed for each individual PDU 
segment. In other embodiments, however, the OQD is computed from a statistical 
weighted average of a plurality of PDU segments received from a given ingress queue of 

25 a given priority level within a predetermined observation interval. Irrespective of the 
means of computation, a OQD is determined, in the preferred embodiment, for each 
priority level of each of the ingress queues of each of the ingress switching devices so as 
to provide a quantitative measure of the current delay experienced across the entire 
enterprise switch 100. 

30 
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In the alternative to the OQD, the switching device 140 in some embodiments generates a 
queue delay expectancy (QDE). The QDE as used herein represents the expectation 
value of the queue delay and is used as a measure of future queue delay. As with the 
OQD, a QDE is computed in the preferred embodiment for each priority level of each of 
5 the ingress queues of each of the ingress blades. The QDE is preferably computed from a 
statistical weighted average of a plurality of PDU segments received from a given ingress 
queue of a given priority level within a predetermined expectation interval. 

In addition to the OQD, the switching device 140 also employs a target queue delay 
10 (TQD) 507 representing the queue delay that one would observe if and when the queue 
scheduling schema was able to acconmiodate current traffic conditions and meet or 
exceed policy QoS objectives. A TQD in the preferred embodiment may be defined for 
each priority level of an ingress queue memory 242. The TQD corresponds to the 
maximum delay or latency that the given priority level should experience in order to 
15 maintain the desired QoS as defined by the one or more policy rules provided by the 
network administrator. 

As illustrated in the comparison step 506, the OQD and TQD are compared for purposes 
of determining whether to enable fragmentation at a particular priority queue of an 

20 ingress blade. In some embodiments, the determination of whether to enable 
fragmentation may be based on one or more additional estimators including the 
bandwidth and delay generated as a function of the bandwidth and delay variation 
observed for every flow using a moving average, low pass filter, or adaptive filter, for 
example. A threshold can be defined for each of these estimators which, when exceeded, 

25 will cause fragmentation to be enabled for an associated flow. 

If it is determined in the QoS assurance testing step 508 that the OQD for a particular 
priority level is greater than the TQD, the switching device takes steps to mitigate the 
queue delay. In the preferred embodiment, the switching device 140 uses the inter-blade 
30 PDU fragmentation operation to reduce the bandwidth consumed by one or more ingress 
queues. The fragmentation is generally applied to one or more priority level queues that 
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consume an excessive or disproportionate quantity of bandwidth at the ingress blade of 
the priority level queue being "starved." The one or more offending queues are 
determined in identifying step 510 for purposes of enabling the corresponding 
fragmentation module. 

5 

If the fragmentation module is not enabled, as determined in fragmentation enabled 
testing step 512, the ingress switching device is directly or indirectly signaled and the 
fragmentation module enabled (step 514) for one or more priority queues. If the 
fragmentation module was previously enabled for the one or more priority queues and the 
10 target queue delay still not satisfied, the fragmentation module in some embodiments is 
adapted to further reduce the maximum fragmentation size (step 516). The size of the 
PDU fragment may be reduced in one or more increments from a relatively high 
maximum PDU fragment size until a relatively low maximum fragmentation size is 
attained. 

15 

If the OQD is less than or equal to the TQD (step 508), the egress switching device 140 
proceeds to determine in disable fragmentation testing step 518 whether it is necessary to 
disable any previously priority queues for which the fragmentation module had been 
enabled. If the OQD is substantially smaller than the TQD, for example, the 
20 fragmentation module may be disabled (step 520) for one or more priority queues at 
which fragmentation was implemented. In some embodiments, the maximum PDU 
fragmentation size may be increased prior to entirely disabling the fragmentation module. 

Illustrated in FIG. 6A is an exemplary inter-module header used to label PDUs and PDU 
25 fragments in between switching devices. The inter-module header, present only within 
the switch 100, comprises an internal Ethernet header 620, an internal VLAN header 622, 
and distribution identifier 614. The inter-module header 600A is appended to a PDU 
segment, i.e. a PDU or PDU fragment, during transit from an ingress switching device to 
an egress switching device. 

30 
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The internal Ethernet header 620 comprises an internal media access control (MAC) 
source 604, internal MAC destination 602, and Ethernet type 606 used internally amongst 
the switching module 140 and the one or more other switching modules operatively 
coupled to the switching fabric 150. The internal Ethernet header 620 is preferably 
5 stripped from the PDU prior to transmission from the egress port and prior to reassembly 
of the PDU fragment. 

The internal VLAN header 622 comprises a priority field, i.e. P-bits 608, correlated to the 
class of service (CoS) used to distinguish traffic types, a canonical format indicator (CFI) 
10 field 610, and a transit VLAN ID field 612 with a unique value to indicate the presence of 
the distribution identifier 614. 

The distribution identifier of the preferred embodiment, illustrated in more detail in FIG. 
6B, comprises an ESC field 630, CoS field 632, DES field 634, and multiplexor identifier 

15 638, The ESC field indicates the presence of one or more additional distribution 

identifiers, thereby allowing more information to be embedded per frame. The CoS field 
632 defines a class of service which the egress switching module may, but need not 
necessarily include in the outbound packet. The DES field 634, representing the discard 
eligibility, indicates to the egress switching module that the particular segment may be 

20 preferentially dropped by the inbound processor. Although the DES field 634 and CoS 
field 632 are represented in what appear to be independent fields, they are usually highly 
correlated, in practice. 

The multiplexor identifier 638, illustrated in detail in FIG. 6C, contains the core control 
25 fields used by an ingress switch module to convey instructions for assembling fragmented 
PDUs and used by an egress switch module to determine the inter-switch module transit 
time. In particular, the multiplexor identifier 638 comprises a fragmentation operation 
code 640 selected from a specific set of remote procedural primitives that define 
instructions for reconstructing the PDU or PDU fragment at the egress switching module. 
30 The multiplexor identifier 638 further comprises a fragment identifier 644 or pointer 
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thereto that identifies the fragment for purposes of reconstructing the original PDU, and a 
timestamp 642 appended to the PDU segment at the ingress switching module. 

One skilled in the art will appreciate that the contents of the inter-module header may 
also be incorporated into one or more test packets for purposes of providing signaling 
and/or control between switching devices. The test packets may be communicated in- 
band with packet data or out-of-band using dedicated control lines. 

Although the description above contains many specifications, these should not be 
construed as limiting the scope of the invention but as merely providing illustrations of 
some of the presently preferred embodiments of this invention. 

Therefore, the invention has been disclosed by way of example and not limitation, and 
reference should be made to the following claims to determine the scope of the present 
invention. 
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